New packages to install for this practicum:

install.packages("devtools")
library(devtools)
install_github("wilkelab/cowplot")

Good visualization is critical to present research findings in an intuitive and appealing way, but it can also enhance our ability to understand data as we analyze it and uncover unexpected patterns. In this practicum, we will learn more about the ggplot2 package, which has greatly simplified the production of high quality graphics in R. The official ggplot2 documentation is helpful to give an idea of how much this package can do. You can also read more in this overview and/or the data visualization chapter of the “R for Data Science” book. Samples of plots that can be made with ggplot2 with accompanying code are available at sites such as UBC Stat shiny server. We will use (1) iris, a classic R built-in dataset that contains measures of sepal length and width, and petal length and width for 50 flowers from each of 3 species of iris (setosa, versicolor, virginica), and (2) the NHANES dataset we’ve used before.

library(dplyr)
library(ggplot2)
nhanes <- read.csv(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008.csv"), header=TRUE)
nhanes <- rename(nhanes, id=SEQN, gender=RIAGENDR, age=RIDAGEYR, race=RIDRETH1, education=DMDEDUC2, income=INDHHIN2, health.provider=HUQ040, wheezing=RDQ070, asthma=MCQ010, voc=WTSVOC2Y, fvc=SPXNFVC, fev1=SPXNFEV1) %>%
    mutate(gender=factor(gender, levels=c(1, 2), labels=c("male", "female"))) %>%
    mutate(race=factor(race, levels=c(3, 1, 2, 4, 5), labels=c("white", "MexicanAmerican", "Hispanic", "black", "other"))) %>%
    mutate(asthma=ifelse(asthma %in% c(7,9), NA, ifelse(asthma==2, 0, 1))) %>%
    filter(!is.na(asthma))
str(nhanes)
## 'data.frame':    9659 obs. of  12 variables:
##  $ id             : int  41475 41476 41477 41478 41479 41480 41481 41482 41483 41485 ...
##  $ gender         : Factor w/ 2 levels "male","female": 2 2 1 2 1 1 1 1 1 2 ...
##  $ age            : int  62 6 71 1 52 6 21 64 66 30 ...
##  $ race           : Factor w/ 5 levels "white","MexicanAmerican",..: 5 5 1 1 2 2 4 2 4 3 ...
##  $ education      : int  3 NA 3 NA 1 NA 3 2 4 2 ...
##  $ income         : int  6 15 5 3 8 7 6 15 5 6 ...
##  $ health.provider: int  1 2 1 2 2 1 1 1 2 NA ...
##  $ wheezing       : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ asthma         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ voc            : num  123988 NA 20915 NA NA ...
##  $ fvc            : int  2473 1742 NA NA 3545 1836 4928 3907 3170 NA ...
##  $ fev1           : int  2025 1551 NA NA 2957 1691 4541 2920 2355 NA ...
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Pipe operator works with ggplot

nhanes %>% 
    mutate(adult.status = factor(age>=18, labels=c("child", "adult"))) %>% 
    filter(adult.status=="child") %>% 
    ggplot(aes(x=age, y=fev1)) + 
        geom_point() +
        geom_smooth(method = "lm", color="red")

More ggplot options and plot types

We’ll look at more options related to ggplot by adding to the layers created before. More options related to setting background, axis, title, and legend options are available in theme. The default settings for ggplot will appropriately select color type (although not necessary color blind or black-and-white printer friendly).

Scatterplot

ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Petal.Width)) + 
    geom_point() + 
    labs(title="Iris Characteristics") +
    labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
    xlim(c(4, 8)) + #Changes x axis limits
    ylim(c(1, 8))   #Changes y axis limits

Change color scale.

ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Petal.Width)) + 
    geom_point() + 
    scale_color_gradient(low="#fee6ce", high="#e6550d") +
    labs(title="Iris Characteristics") +
    labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
    xlim(c(4, 8)) + #Changes x axis limits
    ylim(c(1, 8)) + #Changes y axis limits
    theme_bw() #Changes theme to black and white

This time we can view up to four dimensions of our data by setting color and size, in addition to x and y variables.

ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Species, size=Petal.Width)) + 
    geom_point() + 
    labs(title="Iris Characteristics") +
    labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
    xlim(c(4, 8)) + #Changes x axis limits
    ylim(c(1, 8)) + #Changes y axis limits
    theme_bw() #Changes theme to black and white

Barplot

When creating barplots, use geom_bar with default option stat=bin to get counts in each category. To split panels according to another categorical variable facet_grid can be used. More color options that work for all chart types are described here.

ggplot(data=nhanes, aes(x=race, fill=factor(asthma))) + 
    geom_bar(position="dodge")

ggplot(data=nhanes, aes(x=race, fill=factor(asthma))) + 
    geom_bar(position="dodge") +
    facet_grid(. ~gender) + #Split by another variable
    theme(axis.text.x=element_text(angle=45, hjust=1))

ggplot(data=nhanes, aes(x=gender, fill=factor(asthma))) + 
    geom_bar() +
    scale_fill_brewer(palette="Set1") #ColorBrewer palettes

Boxplot

ggplot(data=iris, aes(Species, Sepal.Length)) + 
    geom_boxplot() 

ggplot(data=iris, aes(Species, Sepal.Length)) +
    geom_boxplot() +
    geom_jitter() #Add a second layer of jittered data points

ggplot(data=iris, aes(Species, Sepal.Length)) +
    geom_boxplot(outlier.colour="red", outlier.size=4) #Making outliers more noticeable

ggplot(data=iris, aes(Species, Sepal.Length)) +
    geom_violin() #violin plots

ggplot(data=iris, aes(Species, Sepal.Length)) +
    geom_violin(fill="lightblue", draw_quantiles = c(0.25, 0.5, 0.75)) +
    geom_jitter(height=0, width=0.1)

ggplot(data=iris, aes(Species, Sepal.Length)) +
    geom_boxplot(aes(fill=Species)) #Coloring according to species

ggplot(data=nhanes, aes(gender, voc)) +
    geom_boxplot(color="darkgreen", fill="darkgreen", alpha=0.5)

ggplot(data=nhanes, aes(gender, voc)) +
    geom_boxplot(aes(fill=factor(asthma))) #Add another variable with fill

Histograms

ggplot(data=nhanes, aes(age)) + 
    geom_histogram(breaks=seq(0, 80, 1), color="blue", fill="blue", alpha=0.7) 

#Density plot
ggplot(data=nhanes, aes(age)) + 
    geom_histogram(aes(y=..density..), breaks=seq(0, 80, 1), color="blue", fill="blue", alpha=0.7) +
    geom_density(color="red")

Smoothing lines

When many points are part of a scatter plot, it can be difficult to see patterns that represent major trends. The geom_smooth command (similar to geom_stat) will plot smooth conditional means. The smoothing function can be specified with the method option as “lm”, “glm”, “gam”, “loess”, “nls”.

ggplot(nhanes, aes(x=age, y=fev1)) +
    geom_point(color="grey") +
    geom_smooth(color="red") #Default option is method="auto" which selects loess for <1000 points and gam with formula y~s(x, bs="cs") otherwise

ggplot(nhanes, aes(x=age, y=fev1)) +
    geom_point(color="grey") +
    geom_smooth(color="red") +
    geom_smooth(color="black", method="lm") +
    geom_smooth(color="green", method="loess")

Additionally, a formula can be given for use in the smoothing function.

x <- 1:10
y <- jitter(x^2)
DF <- data.frame(x, y)

ggplot(DF, aes(x=x, y=y)) + 
    geom_point() +
    geom_smooth(method='lm', aes(color = 'linear'), se=FALSE) +
    geom_smooth(method='lm', formula = y ~ poly(x, 2), aes(color = 'polynomial'), se=FALSE) +
    geom_smooth(method='nls', 
                formula = y ~ a*log(x) + b, 
                method.args = list(start = c(a=1, b=1)),
                aes(color = 'logarithmic'), 
                se=FALSE) +
    geom_smooth(method='nls', 
                formula = y ~ a*exp(b*x), 
                method.args = list(start = c(a=1, b=1)),
                aes(color = 'Exponential'), 
                se=FALSE)

Contours of 2D density estimates

When many points are in a 2D scatter plot, it can similarly be difficult to tell where the majority of points are. Density plots can help us interpret data relationships when there is overplotting.

ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
    geom_point(color="grey") +
    stat_density2d(aes(fill=..level.., alpha=..level..), geom="polygon", color="black") +
    scale_fill_continuous() +
    guides(alpha="none")

Raster of 2D density estimates

If the contours are left out, then stat_density2d can be used to produce a raster of the data.

ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
    stat_density2d(aes(fill=..density..), geom="raster", contour=FALSE) +
    scale_fill_continuous(low="#238443", high="#ffffcc") +
    guides(alpha="none")

viridis color scheme

This color scheme is a personal favorite for use with heatmaps and other 2D plots that, as of 2018, is part of ggplot2.

ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
    stat_density2d(aes(fill=..density..), geom="raster", contour=FALSE) +
    scale_fill_viridis_c() +
    guides(alpha="none")

Saving figures

To save graphical output to a file in R, one opens a graphics device to redirect output of plots to them, and when plots are completed, one closes the device to save the final version. To see all available graphical devices, type ?Devices. The default device in R is usually an interactive window, like the one we see in RStudio. Here are two examples using common formats.

options("device") #To see the current graphics device
## $device
## function (width = 7, height = 7, ...) 
## {
##     grDevices::pdf(NULL, width, height, ...)
## }
## <bytecode: 0x7faf09457d78>
## <environment: namespace:knitr>
fig <- ggplot(data=iris, aes(Species, Sepal.Length)) +
        geom_boxplot(color="darkblue", fill="blue")
#Save as PDF
pdf("Ex1.pdf", width=6, height=5) #Units are inches
fig
dev.off()
## quartz_off_screen 
##                 2
#Save as TIFF
tiff("Ex1.tiff", width=480, height=480, units="px")
fig
dev.off()
## quartz_off_screen 
##                 2

ggplot2 offers a simplified function to save graphics output: ggsave.

fig

ggsave("Ex2.pdf") #Default plot to save is last plot
ggsave("Ex2.tiff", plot=fig)

Publication graphics

The theme used with ggplot2 can be customized to suit your personal taste. One researcher created an R package called cowplot as a ggplot2 add-on to easily create publication-ready graphics.

library(cowplot)
save_plot("Ex3.pdf", fig, base_aspect_ratio=1.3) #Save using the default format of cowplot